DOMAIN: Telecom

CONTEXT :

A telecom company wants to use their historical customer data to predict behaviour to retain customers. You can 
analyse all relevant customer data and develop focused customer retention programs.

DATA DESCRIPTION:

Each row represents a customer, each column contains customer’s attributes described on the column Metadata. The data set includes information about:

• Customers who left within the last month – the column is called Churn
• Services that each customer has signed up for – phone, multiple lines, internet, online security, 
online backup, device protection, tech support, and streaming TV and movies
• Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly 
charges, and total charges
• Demographic info about customers – gender, age range, and if they have partners and dependents

PROJECT OBJECTIVE:

Build a model that will help to identify the potential customers who have a higher probability to churn. 
This help the company to understand the pinpoints and patterns of customer churn and will increase the focus on strategising customer retention

Import and warehouse data:

• Import all the given datasets from MYSQL server. Explore shape and size. 
• Merge all datasets onto one and explore final shape and size
Import all the given datasets from MYSQL server.
There are only 10 column in this dataset, we need to concatenate other dataset with this one.
Merge all datasets onto one
It is noticed that, there are 2 ways to import the the provided data. In below cell we import data using CSV format and compare it with mysql data.
True signifies that both data set are equal. The data import is fine !
Explore shape and size

Data cleansing

• Missing value treatment
• Convert categorical attributes to continuous using relevant functional knowledge
• Drop attribute/s if required using relevant functional knowledge
• Automate all the above steps
Missing value treatment
It is mentioned to automate the task of finding and imputing missing values, below mentioned line helps achieve that
Whereever we find the column as null in the dataset it is replaced with mean value. Here we loop through all columns which are having NA as values and replace them, there by automating once we get new data.
Convert categorical attributes to continuous using relevant functional knowledge
Above columns looks to be categorical. PaymentMethod and gender can be OneHotEncoded, rest can be converted using dummies function since these column values can be graded - as having some service enabled or not enabled.
We observe good correlation between MonthlyCharges and TotalCharges
We observe good correlation between Tenure  and TotalCharges
Ther existing continous columns are having good correlation with the Class variable.
droping all the column which are determined as not important in the chiSquare test, there by automating once we get new data.
All columns are in numerical data type as needed to build the model.

3. Data analysis & visualisation

• Perform detailed statistical analysis on the data.
• Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis.
Perform detailed statistical analysis on the data.
For continous variable - we shall descirbe to know more about variable and also plot few graphs to know more about behaviour patterns.

There are only 3 continous variable in the dataset

Tenure:

   75% of data are below value 55, there are no outliers observed in data since max and 75% value are not having too much difference.
   Since min value is 0, we need to take closer look at data.

MonthlyCharges:

    75% of data are below value 89, there are no outliers observed in data sicne max and 75% value are not having too much difference.
    Mean and Median are having difference where in mean value is less then median and there would be right skewness in data.

TotalCharges:

    75% of data are below 3795, Since max and 75% are having difference. We need check if outlier exists for the data.
Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis.
For categorical variable - bar graph, count plot  would help to know more about behaviour patterns.

Based on Tenure -

Based on Monthly Charges -

Based on Monthly Charges -

Based on Tenure -

Data pre-processing:

• Segregate predictors vs target attributes
• Check for target balancing and fix it if found imbalanced.
• Perform train-test split.
• Check if the train and test data have similar statistical characteristics when compared with original data
Segregate predictors vs target attributes
Check for target balancing and fix it if found imbalanced.
Perform train-test split.
Check if the train and test data have similar statistical characteristics when compared with original data.
Model training, testing and tuning:
Bagging based different classifiers methods.
adaboosting
GB classifier
xgboost
Random Forest

GUI development

The print statement would help us to view the user selected input.

Conclusion and improvisation:

Conclusion:
Improvment: